library(tidyverse)
library(gapminder)The following (df <- gapminder) is a short-hand
of
df <- gapminder
df
(df <- gapminder)We should know first about the variables. At least you must know if each of the variables is a categorical variable or a numerical variable.
For example, in the gapminder data,
country, continent are categorical variables,
and year, lifeExp, pop,
gdpPercap are numerical variables. It is possible to treat
year as a categorical variable.
datasets::CO2You can obtain basic information of the data by the following or typing CO2 in the search box under Help tab. You can see the same at: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html
help(CO2) # or ? CO2An object of class c(“nfnGroupedData”, “nfGroupedData”, “groupedData”, “data.frame”) containing the following columns:
Plant: an ordered factor with levels Qn1 < Qn2 < Qn3 < … < Mc1 giving a unique identifier for each plant.
Type: a factor with levels Quebec Mississippi giving the origin of the plant
Treatment: a factor with levels nonchilled chilled
conc: a numeric vector of ambient carbon dioxide concentrations (mL/L).
uptake: a numeric vector of carbon dioxide uptake rates (/m^2μmol/m 2 sec).
df_co2 <- as_tibble(datasets::CO2) # what happens if simply `df_co2 <- datasets::CO2`
df_co2You can use head(CO2) if you set
df_co2 <-CO2 or
df_co2 <- datasets::CO2.
glimpse(df_co2)Rows: 84
Columns: 5
$ Plant <ord> Qn1, Qn1, Qn1, Qn1, Qn1, Qn1, Qn1, Qn2, Qn2, Qn2, Qn2, Qn2, Qn2, Qn2, Qn3, Qn…
$ Type <fct> Quebec, Quebec, Quebec, Quebec, Quebec, Quebec, Quebec, Quebec, Quebec, Quebe…
$ Treatment <fct> nonchilled, nonchilled, nonchilled, nonchilled, nonchilled, nonchilled, nonch…
$ conc <dbl> 95, 175, 250, 350, 500, 675, 1000, 95, 175, 250, 350, 500, 675, 1000, 95, 175…
$ uptake <dbl> 16.0, 30.4, 34.8, 37.2, 35.3, 39.2, 39.7, 13.6, 27.3, 37.1, 41.8, 40.6, 41.4,…
“factor” is a categorical data, and “double” is a numerical data.
class(df_co2$Plant)[1] "ordered" "factor"
class(df_co2$Type)[1] "factor"
class(df_co2$Treatment)[1] "factor"
class(df_co2$conc)[1] "numeric"
class(df_co2$uptake)[1] "numeric"
summary(df_co2) Plant Type Treatment conc uptake
Qn1 : 7 Quebec :42 nonchilled:42 Min. : 95 Min. : 7.70
Qn2 : 7 Mississippi:42 chilled :42 1st Qu.: 175 1st Qu.:17.90
Qn3 : 7 Median : 350 Median :28.30
Qc1 : 7 Mean : 435 Mean :27.21
Qc3 : 7 3rd Qu.: 675 3rd Qu.:37.12
Qc2 : 7 Max. :1000 Max. :45.50
(Other):42
Then you can choose appropriate ones later in your research.
df_co2 %>% ggplot(aes(x = conc)) + geom_histogram()df_co2 %>% ggplot(aes(x = uptake)) + geom_histogram()df_co2 %>% ggplot(aes(x = factor(conc), y = uptake)) + geom_boxplot()df_co2 %>% ggplot(aes(x = factor(conc), y = uptake, fill = Type)) + geom_boxplot()df_co2 %>% ggplot(aes(x = factor(conc), y = uptake, fill = Treatment)) + geom_boxplot()df_co2 %>% ggplot(aes(x = Plant, y = uptake, fill = Treatment)) + geom_boxplot()df_co2 %>% ggplot(aes(x = Plant, y = uptake, fill = Type)) + geom_boxplot()df_co2 %>% ggplot(aes(x = Plant, y = uptake, fill = Type, color = Treatment)) + geom_boxplot()What can you see? Write your observations.
df_co2 %>% ggplot(aes(x = conc, y = uptake, color = Treatment)) + geom_point()df_co2 %>% ggplot(aes(x = Plant, y = Type, color = Treatment, size = conc)) + geom_point()df_co2 %>% ggplot(aes(x = Plant, y = Type, size = conc, shape = Treatment)) + geom_point() + facet_wrap(vars(Treatment))ggplot(data = df_co2) +
geom_point(aes(x = conc, y = uptake))The following prints a vector.
df_co2 %>% distinct(conc) %>% pull()[1] 95 175 250 350 500 675 1000
The following code generates a data frame.
df_co2 %>% distinct(conc)ggplot(data = CO2) +
geom_line(aes(x = conc, y = uptake))The code above did not work, and the line graph is not appropriate in
this case. There are so many update values at the same
conc.
datasets::SeatbeltsSearch the data information.
Road Casualties in Great Britain 1969–84
References Harvey, A. C. and Durbin, J. (1986). The effects of seat belt legislation on British road casualties: A case study in structural time series modelling. Journal of the Royal Statistical Society series A, 149, 187–227. doi:10.2307/2981553.
The paper is available as you log-in to ICU Library > E-Databases > JSTOR
Can you see the difference of the following two codes?
head(Seatbelts) DriversKilled drivers front rear kms PetrolPrice VanKilled law
[1,] 107 1687 867 269 9059 0.1029718 12 0
[2,] 97 1508 825 265 7685 0.1023630 6 0
[3,] 102 1507 806 319 9963 0.1020625 12 0
[4,] 87 1385 814 407 10955 0.1008733 8 0
[5,] 119 1632 991 454 11823 0.1010197 10 0
[6,] 106 1511 945 427 12391 0.1005812 13 0
df_sb <- as_tibble(datasets::Seatbelts)
df_sbsummary(df_sb) DriversKilled drivers front rear kms
Min. : 60.0 Min. :1057 Min. : 426.0 Min. :224.0 Min. : 7685
1st Qu.:104.8 1st Qu.:1462 1st Qu.: 715.5 1st Qu.:344.8 1st Qu.:12685
Median :118.5 Median :1631 Median : 828.5 Median :401.5 Median :14987
Mean :122.8 Mean :1670 Mean : 837.2 Mean :401.2 Mean :14994
3rd Qu.:138.0 3rd Qu.:1851 3rd Qu.: 950.8 3rd Qu.:456.2 3rd Qu.:17202
Max. :198.0 Max. :2654 Max. :1299.0 Max. :646.0 Max. :21626
PetrolPrice VanKilled law
Min. :0.08118 Min. : 2.000 Min. :0.0000
1st Qu.:0.09258 1st Qu.: 6.000 1st Qu.:0.0000
Median :0.10448 Median : 8.000 Median :0.0000
Mean :0.10362 Mean : 9.057 Mean :0.1198
3rd Qu.:0.11406 3rd Qu.:12.000 3rd Qu.:0.0000
Max. :0.13303 Max. :17.000 Max. :1.0000
Which visualization do you apply?
df_sb %>% ggplot(aes(x = factor(law), y = DriversKilled)) + geom_boxplot()What do you observe above?
df_sb %>% ggplot(aes(x = PetrolPrice, y = DriversKilled)) + geom_point() +
geom_smooth(formula = y~x, method = "lm", se = FALSE)What can you see above?
df_sb %>% ggplot(aes(x = kms, y = DriversKilled)) + geom_point() +
geom_smooth(formula = y~x, method = "lm", se = FALSE)What can you see above?
We will learn how to use pivot_longer and
pivot_wider in EDA4.
df_sb %>%
pivot_longer(cols = 2:4, names_to = "seat", values_to = "value")df_sb %>%
pivot_longer(cols = 2:4, names_to = "seat", values_to = "value") %>%
ggplot() + geom_boxplot(aes(x = seat, y = value, fill = seat))What can you observe?
df %>% filter(continent == "Asia") %>%
ggplot(aes(x = year, y = gdpPercap, color = country)) + geom_line()Appropriate graph?
ggplot(df, aes(x = continent)) + geom_bar()ggplot(df, aes(x = continent, fill = continent)) + geom_bar(width = 0.75)ggplot(df, aes(x = continent, y = pop)) + geom_boxplot()ggplot(df, aes(x = continent, y = log10(pop))) + geom_boxplot()Alternately, you can use
coord_trans(x = "identity", y = "log10") in stead of
y = log10(pop). Can you see the difference?
ggplot(df, aes(x = continent, y = pop)) + geom_boxplot() +
coord_trans(x = "identity", y = "log10")df %>% filter(year %in% c(1957, 1982, 2007)) %>%
ggplot() + geom_boxplot(aes(x = continent, y = pop, fill = factor(year))) +
coord_trans(x = "identity", y = "log10")df %>% filter(year %in% c(1957, 1982, 2007)) %>%
ggplot() + geom_boxplot(aes(x = continent, y = gdpPercap, fill = factor(year))) +
coord_trans(x = "identity", y = "log10")ggplot(df, aes(gdpPercap, lifeExp)) + geom_point(aes(color=continent))df %>% ggplot(aes(gdpPercap, lifeExp, color = continent)) + geom_point() +
coord_trans(x = "log10", y = "identity")df %>% filter(year %in% c(2007)) %>%
ggplot() + geom_point(aes(x = pop, y = gdpPercap, col = continent)) +
coord_trans(x = "log10", y = "log10")df %>% filter(year %in% c(1957)) %>%
ggplot() + geom_point(aes(x = gdpPercap, y = lifeExp, col = continent, size = pop)) +
coord_trans(x = "log10", y = "identity")df %>% filter(year %in% c(2007)) %>%
ggplot() + geom_point(aes(x = gdpPercap, y = lifeExp, col = continent, size = pop)) +
coord_trans(x = "log10", y = "identity")df %>% filter(year %in% c(1957, 2007)) %>%
ggplot() + geom_point(aes(x = gdpPercap, y = lifeExp, col = continent, size = pop)) +
coord_trans(x = "log10", y = "identity") +
facet_wrap(vars(year))df_lifeExp <- df %>% group_by(continent, year) %>%
summarize(mean_lifeExp = mean(lifeExp), median_lifeExp = median(lifeExp), max_lifeExp = max(lifeExp), min_lifeExp = min(lifeExp))`summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
The code above gives a message, but it works.
df_lifeExp %>% ggplot(aes(x = year, y = median_lifeExp, color = continent)) +
geom_line()If you do not want to have a message, the following is an option.
Otherwise, grouping is kept and you can get the original data back by
ungroup().
df_pop <- df %>% group_by(continent, year) %>%
summarize(mean_pop = mean(pop), median_pop = median(pop), max_pop = max(pop), min_pop = min(pop), .groups = "drop")df_pop %>% ggplot(aes(x = year, y = mean_pop, color = continent)) +
geom_line()The following two codes create the same chart.
df_pop %>% ggplot(aes(x = year, y = mean_pop, color = continent, linetype = continent)) +
geom_line()df_pop %>% ggplot() +
geom_line(aes(x = year, y = mean_pop, color = continent)) +
geom_line(aes(x = year, y = median_pop, linetype = continent))Eg. Smoking, Alcohol and (O)esophageal Cancer
(df_esoph <- as_tibble(esoph))df_esoph has three categorical variables and one
numerical variable ncases to investigate.
Comments: I wanted to include three variables in the first exercise to be able to compare tobacco consumption, number of cases of cancer, and age in the same graph but I was not able to do it.
Solutions: There are various ways you can choose from.
Scatter plot with size by geom_point().
ggplot(df_esoph) + geom_point(aes(agegp, tobgp, size=ncases))Heatmap with geom_tile()
ggplot(df_esoph) + geom_tile(mapping = aes(x = agegp, y = tobgp, fill = ncases))geom_boxplot()
ggplot(df_esoph, aes(x= tobgp, y=ncases, fill=agegp))+geom_boxplot()ggplot(df_esoph, aes(x= agegp, y=ncases, fill=tobgp))+geom_boxplot()geom_col()
ggplot(df_esoph, aes(x= agegp, y=ncases, fill=tobgp))+geom_col()Default position is “stack”.
ggplot(df_esoph, aes(x= agegp, y=ncases, fill=tobgp))+geom_col(position = "dodge")df %>%
select(country, continent, year, gdpPercap) %>%
filter(continent %in% c("Asia", "Europe")) %>%
group_by(continent, year) %>%
summarise(mean_GDPperCapita = mean(gdpPercap)) %>%
ggplot(aes(x=year)) +
geom_line(aes(y=mean_GDPperCapita, color=continent)) +
ggtitle("GDP oer capita by continents, 1950's to today") `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument.
df %>%
select(country, year, gdpPercap) %>%
filter(country %in% c("Israel", "Japan", "Norway", "China", "Ireland")) %>%
ggplot(aes(x=year)) +
geom_line(aes(y=gdpPercap, color=country))Question. I have not managed to add on the same graph of the continents the data for the individual countries, as I would have liked:
Solution. Construct two data sets and combine them into one.
ggplot2 starts with one data.
df_2c <- df %>%
select(continent, year, gdpPercap) %>%
filter(continent %in% c("Asia", "Europe")) %>%
group_by(continent, year) %>%
summarise(gdpPercap = mean(gdpPercap), .groups = 'drop') %>%
select(country = continent, year, gdpPercap)
df_5c <- df %>%
select(country, year, gdpPercap) %>%
filter(country %in% c("Israel", "Japan", "Norway", "China", "Ireland"))
df_2c %>% bind_rows(df_5c) %>%
ggplot(aes(x = year, y = gdpPercap, color = country)) + geom_line()Use mutate.
df %>%
group_by(continent, year) %>%
mutate(mean_by_continent = mean(gdpPercap)) %>%
ungroup() %>%
filter(country %in% c("Israel", "Japan", "Norway", "China", "Ireland")) %>%
ggplot(aes(x = year)) +
geom_line(aes(y = gdpPercap, color=country)) +
geom_line(aes(y = mean_by_continent, linetype=continent)) +
labs(title = "GDP oer capita of five countries", subtitle = "Mean of GDP per capita of their continent") When you want to change the linetype manually, use
scale_linetype_manual().
df %>%
group_by(continent, year) %>%
mutate(mean_by_continent = mean(gdpPercap)) %>%
ungroup() %>%
filter(country %in% c("Israel", "Japan", "Norway", "China", "Ireland")) %>%
ggplot(aes(x = year)) +
geom_line(aes(y = gdpPercap, color=country)) +
geom_line(aes(y = mean_by_continent, linetype=continent)) +
scale_linetype_manual(values = c("Asia" = "twodash", "Europe" = "longdash")) +
labs(title = "GDP oer capita of five countries", subtitle = "Mean of GDP per capita of their continent") Example: Default
df %>%
filter(country %in% c("Germany", "Japan", "United States")) %>%
ggplot() +
geom_line(aes(x = year, y = gdpPercap, color=country, linetype=continent))scale_color_manual: https://ggplot2-book.org/scale-colour.html
scale_fill_manual: similar to
scale_color_manualscale_linetype_manual: https://ggplot2-book.org/scale-other.html?q=linetype#scale-linetypedf %>%
filter(country %in% c("Germany", "Japan", "United States")) %>%
ggplot(aes(x = year, y = gdpPercap)) +
geom_line(aes(color=country, linetype=continent)) +
geom_point(aes(shape = country)) +
scale_colour_manual(values = scales::hue_pal(direction = -1)(3)) +
scale_linetype_manual(values = c("Europe" = "dotted", "Asia" = "dotdash", "Americas" = "longdash")) +
scale_shape_manual(values = c("Germany" = 7, "Japan" = 9, "United States" = 12))